Lec04 - Wed 2/22: 5NG
What is a statistical graphic?
- Today we kick off Topic 2.b) Data Visualization by asking ourselves: What is a statistical graphic?
- But a brief lesson from military history first
Napoleon’s March on Russia in 1812
In 1812, Napoleon led a French invasion of Russia, marching on Moscow.
Napoleon’s March on Russia in 1812
It was one of the biggest military disasters ever, in particular b/c of the Russian winter.
Minard’s Illustration of the March
Famous graphical illustration of Napolean’s march to/from Moscow
Minard’s Illustration of the March
This was considered a revolution in statistical graphics because between
- the map on top
- the line graph on the bottom
there are 6 dimensions of information (i.e. variables) being displayed on a 2D page.
The Grammar of Graphics
A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.
Minard’s Illustration of the March
| top map |
longitude |
x |
point |
| “ |
latitude |
y |
point |
| “ |
army size |
size |
path |
| “ |
army direction (forward vs retreat) |
color |
path |
| bottom graph |
date |
x |
line & text |
| “ |
temperature |
y |
line & text |
Name this Graph
From ggplot2movies package, the movies data set:

Name this Graph
From nycflights13 package, the flights data set:

Name this Graph
From fueleconomy package, the vehicles data set:

Name this Graph
From babynames package, the babynames data set:

Name this Graph
From okcupiddata package, the profiles data set:

5NG
Say hello to the 5NG: the five named graphs
- Scatterplot AKA bivariate plot
- Line-graph
- Boxplot
- Barplot AKA Barchart AKA bargraph
- Histogram
Lec02 - Thu 2/16: R Packages
Exercise
In small teams, take 3 minutes to write down
- A couple of male and female names that are “modern”
- A couple of male and female names that are “old-fashioned”
- One male and one female name that are “back in vogue”
Learning R
- Computers are stupid! You need to:
- Tell it exactly and everything it needs to do
- Everything needs to be perfect:
- Write everything from scratch
- Names of “stuff” need to typed exactly
- Parentheses need to match
- Recall: This is not a class on programming/coding. However, we’ll learn just enough to do statistics and data science
- Side Benefit: Many of the concepts translate to almost all programming languages: python, javascript, etc.
Learning R
Recall the tradeoff:
What are R Packages?
- Base R, i.e. R straight out of the box. It’s fairly limited in power and functionality.
- R Packages are extensions to R that are
- contributed by a world-wide community of R users
- extend base R’s functionality
- are downloadable over the internet from RStudio.
Step 1: How Do I Install a Package?
You need to install each package once.
- In RStudio: Go to Files Panel -> Packages -> Install
- Type in the package name and click install
- The procedure for updating a package is the same
Step 2: How Do I Load a Package?
You need to load a package everytime you want to use it.
- Run
library(PACKAGENAME) in the console.
Baby’s First R Packages
Today’s Learning Check: Install and then load 3 packages:
dplyr: a package for data manipulation
ggplot2: a package for data visualization
babynames: a package of baby name data
babynames Package
The babynames package contains for each year from 1880 to 2013, the number of children born of each sex given each name in the United States. Only names with more than 5 occurrences are considered.
Lec01 - Mon 2/13: Introduction
Course Title
- In catalog: Introduction to Statistical Sciences
- New: Introduction to Statistical and Data Sciences
Data Science
- Example domains: biology, economics, physics, sociology, etc.
- So why the title switch?
Course Objective #1
Have students engage in the data/science research pipeline in as faithful a manner as possible while maintaining a level suitable for novices.
- Cobb: Minimizing prerequisites to research
- Not necessarily publishing in top journals, but answering scientific questions with data.
- Difficult to do research without understanding stats, however
Data/Science Research Pipeline
We will, as best we can, perform all this:
Data/Science Research Pipeline
And not just this, as in many previous intro stats courses:
Course Objective #2
Foster a conceptual understanding of statistical topics and methods using simulation/resampling and real data whenever possible, rather than mathematical formulae.
- Whenever we can, use real data
- Example data set: nycflights13
- There are two “engines” that can make statistics “work”
- Mathematics: formulas, approximations, etc
- Computers: simulations, random number generation
The “Engine” of Statistics
In this course, computers and not math will be the “engine”. What does this mean?
- Less of this:

- But more of this:

Programming/Coding
- Previous programming/coding experience is not a prerequisite to this course
- This course is not an explicit course on programming, coding, nor computer science. But we will use some elements.
- Also you will be exposed to basic algorithmic thinking and computational logic
- Learning R is like learning a foreign language: its really hard at first!
Two Simple Rules of Learning Code
- Computers are stupid!
- When learning, take existing code that works, and tweak it!
Course Objective #3
Blur the traditional lecture/lab dichotomy of introductory statistics courses by incorporating more computational and algorithmic thinking into the syllabus.
- Completely separate lecture and labs is a legacy of a time before

RStudio Server
- Not all laptops are created equal: operating system, processing power, age
- RStudio Server: cloud-based version of RStudio where all processing is done on Middlebury servers
go/rstudio/ (on campus or via VPN)
Course Objective #5
Develop statistical literacy by, among other ways, tying in the curriculum to current events, demonstrating the importance statistics plays in society.
- H.G. Wells (paraphrased): “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
- Me: “Sure, it’s easy to lie with statistics. But it’s also hard to tell the truth without them.”
Final Project
- Capstone experience to align this topics and principles of this course with how research and learning is done in practice.
- Work on interpersonal and collaborative skills. No textbook on that!
R, RStudio, and DataCamp
- R: Software behind the scenes i.e. the engine
- RStudio: Intergrated development environment i.e. the interface
- DataCamp: Browser-based learning tool i.e. the driver’s ed teacher
Test Drive RStudio
- Login to
go/rstudio/ with your Midd account
- If you don’t have access, raise your hand. (Username: guest1, password: rstudioguest)
- In RStudio menu bar -> File -> New File -> R Script
The Four Panels
- Console: Crunch numbers in R
- Files, Packages, Help: See your files, install packages, help files
- Editor: Where you’ll write code and save it
- Environment: Your workspace
Important: Console
- This is where you run/execute commands
- The “>” is the prompt. It means R is ready to receive commands
- If you don’t see a “>” and want to restart, press ESC.
Switching Gears
Now we will use R via DataCamp instead of via RStudio, but just for driver’s ed. Two panels exist in both:
- Editor panel: Where you write code
- Console panel: Where you will execute code